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Abstract. - Statistical physics is used to investigate independent component analysis with 
polynomial contrast functions. While the replica method fails, an adapted cavity approach 
yields valid results. The learning curves, obtained in a suitable thermodynamic limit, display 
a first order phase transition from poor to perfect generalization. 



During the last decade, independent component analysis (ICA) has emerged as one of 
the most powerful unsupervised learning procedure for many signal processing tasks IDHl. It 
assumes that the observed, often high dimensional signal, is a linear mixture of independent 
source signals and aims to recover these sources just from observing the mixed up signal. 
Hence, ICA is sometimes also called blind signal deconvolution. An illustrative scenario is the 
cocktail party problem where, to understand any single speaker, we first need to identify her 
voice amidst the jumble of sounds reaching our ears. 

The basic finding in ICA is that the distribution of the observed signal will be similar to 
a Gaussian, especially when many independent sources contribute to the linear mixture. The 
source signals, however, will often be highly structured, and non-Gaussian. ICA thus searches 
for a linear transformation of the observations which maximizes non-Gaussianity by evaluating 
a suitable contrast function. To detect this, the contrast function used must compute a higher 
than quadratic statistics of the transformed data. 

In a principled way, ICA can be derived by considering the mutual information of the 
transformed data, which is a natural measure of statistical dependence. To avoid the problem 
of density estimation, which arises in a direct evaluation of the mutual information, one then 
uses expansions (Edgeworth, Gram-Charlier) around Gaussianity to approximate the mutual 
information PIE]- This leads to contrast functions which are related to the higher order 
cumulants of the transformed data. 

This Letter provides a first analysis of ICA for polynomial contrast functions using the 
statistical physics of disordered systems. Surprisingly, the replica method, one of the most 
powerful tools in analyzing quenched disorder, fails since it cannot control the contributions 
to the contrast function in the large deviations regime. However, a physically valid analysis is 
obtained by adapting the cavity method, showing that the scale of the learning curve depends 
on the degree of the polynomial. Unusually, for a system with continuous couplings, the curve 
itself is a step function, jumping from poor to perfect generalization. But a badly generalizing 
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state is always metastable and it is remarkable that we can nevertheless find polynomial time 
algorithms which generalize well. 

In formal terms, we assume that the observable signal ^ can be written as ^ = M^, where 
the source ^ is an iV-dimensional random variable with independent components and M is the 
N hy N mixing matrix. Learning is based on a training set D of P independent observations ^'^ 
of the signal ^, obtained for a fixed, if unknown, mixing matrix M . The deconvolution problem 
(finding ^) can be decomposed by first finding just one independent component, subtracting 
it from the mixture, and reapplying the procedure to the remaining A'^ — 1 dimensional task. 
Hence, I shall just deal with finding the first component and assume that it is non-Gaussian 
whereas all other components of ^ are Gaussian. 

Normally, the first step in IGA is to whiten the data, so that it has zero mean and its 
covariance matrix is the identity. So, I shall further assume that the source components have 
zero mean and unit variance and that M is orthogonal, Al'^M = 1. In short, the ICA task 
now is to find, based on the training set D, a vector J such that J^£^ = ±^i. For this, one 
picks a suitable non-quadratic contrast function g, computes the empirical contrast 

p 

cd(J)=P-^^5(J^M|'^), (1) 
u.=i 

and chooses J to maximize co{J) under the constraint | J| = 1. To analyze this problem, one 
will first consider the Gibbs weight exp{(3Nco{J)) at some finite inverse temperature /3 and 
calculate the typical value of the logarithm of its partition function Zo = J dJ exp{PNca{J)) , 
where the integration is over the uniform density on the unit sphere in M^. Since, via a gauge, 
the partition function is independent of the mixing matrix M, we set AI — 1 for the analysis. 

I shall first consider the replica approach to this calculation and for brevity assume that 
the contrast function is g{x) = x^. We are then immediately faced with the problem that the 
moments {Z^)^ do not exist, indeed Z-q does not even have a mean (^). A second issue arises 
since co{J) is 0{N^^^ /P) for J = So, if we have just P — aN examples, \nZa is not 

an extensive quantity for large N. 

To address the first problem, we introduce a cutoff K^j > 0, replacing 17(2;) = x'^ by 
gnix) = Tsiax.{x^ , K^} in Eq. Since we want to ultimately recover the g{x) = x^ case, we 
assume that i^T^ diverges with increasing N. Nevertheless, due to the cutoff, the moments of 
Zo now exist for any finite N. Further, we assume that the training set has P = aL^N and 
not just aN patterns. Then, if L^j diverges sufficiently quickly w.r.t. N and InZo will 
be an extensive quantity. Finally, we should find that for the purpose of calculating In Zjt for 
large N, choosing Kn ~ VN is equivalent to not cutting off at all. The reason for this quite 
simply is that for N ^ 00 the fields J"^^^ are bounded by for almost all training sets. 

In this setting, standard arguments yield the exact finite N result 

= AAr,„y"di?dQdet(Q-i?i?^)^^g„(i?,Q)^ 



C(Li\r , , 

Here R is an n- vector, Q a symmetric n hy n matrix with Q"" — 1, and the domain of 
integration is such that the matrix Q — RR^ is positive definite. The AT" are zero mean 

(^)In a sense, this problem already crops up for principal component analysis where g{x) = . Then (^p)p 
diverges, if n or /3 are large enough. So, using replicas, one is in effect computing a continuation from small /3 
and large n to large /3 and small n. 
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Gaussian with covariances (X'^X^) = Q"'' - R'^R'', and Ajv,„ is obtained using that the 
moments equal 1 for (3 = 0. Now, given any sequence of cutoffs Kj^, we can certainly find Lj^ 
so that Gn{R,Q) stays finite for large N. Then, we should be able to use Laplace's method 
of the maximum point to find that in the large N limit 

1 In = sup In Gn {R, Q) + ^ In det{Q-RR^) . (2) 

R,Q ^ 

But at this point, at the latest, it is clear that something is amiss. The limiting value of the 
above RHS depends only on the relative scalings of Kn and Ln and not on the relationship 
of these scalings to the system size N. So ^ implies that the scale of learning curve can 
be arbitrarily stretched by using cutoffs which diverge quickly with N. This problem arises 
regardless of assumptions about replica symmetry. 

We proceed anyway and, using the replica symmetric parameterization of (01, find for 
N ^ oo 

^(InZD)e = supinf Gr{q, R) + Gs{q,r) 

iV r 1 



Gr{q,R) = aLN (\n(cxp [ —J— 



N 



fHi + \/<i- r^yo + - <iyi)) ) ) 



Gs{q,r) = i^ + iln(l-g) (3) 

where yo,yi are standard Gaussians, i.e. with zero mean, unit variance. The extremal r is 
just the typical value of the first component of a weight vector picked from the Gibbs density 
and measures to which extent the structure in the data is recognized. Using (jS)), we relate the 
scahngs of and L„. For Ljv ^ the energy term converges to Gr{q, R) = r^ (Ci)- This 
is the limit of many examples where r = 1 for all a. In contrast, for L„ <C K„ there are too 
few examples and Gr{q,R) diverges. 

So, the scale of the learning curve is given by setting Ljv = ^n- On this scale, we find that 
Gr{q,R) converges to as in the limit of many examples if q exceeds a critical value 

qc{a,(3), whereas Gr{q,R) diverges for q < qc{a,f3). Solving the extremal problem for q by 
taking the limit q — > qc{a,(3) from above, then taking the /? — > oo limit, we finally find the 
simple result for the ground state: c{a) = sup^ r^ i^i)^ + (1 ~ r^)/a. Here c{a) is the typical 
value of the highest achievable empirical contrast, max| cn{J). The learning curve for r 
thus obtained, is a step function showing a first order phase transition at etc = 1/ (^i)^! fro™ 
no learning (r = 0) to perfect learning (r = 1). But the r = state is metastable for all values 
a > ac- 

The replica theory predicts that for any divergent sequence of cutoffs K^, e.g. Kj^ = e^, we 
need P > acKj^N examples for good generalization when is large. While this is ridiculous, I 
have argued above that choosing K,„ = ^/N is, for N oo, equivalent to not cutting off at all. 
To compare the replica result for this choice of _K„ to numerical simulations, let us consider 
actually finding a weight vector maximizing coiJ)- It turns out that a rather simple discrete 
dynamics can be used since g{x) = x^. Starting with a random vector of unit length J*', at 

the k-th time step we first compute the matrix A{j'^) = X]^=i £,^{J'' and then choose 

J*^"*"^ to maximize \J'^A{J'')J\ under the constraint |J| = 1. So, J'^+i is an eigenvector to 
the eigenvalue of largest magnitude of A( j'^). Standard results on quadratic forms imply that 
17^^+^ A{J'')J''^^\ > \J'^ A{J''~^)J''\, and the inequality is strict unless we are at a fixed 
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Fig. 1 - Prediction of Kw = \/N replica theory (bold line) compared to simulation results. The non 
Gaussian source is ii = [y'^ - 1)/V^, where y is a standard Gaussian. The empty symbols show the 
results for the algorithm finding local maxima of the empirical contrast. The full symbols, denoting 
results for the iterated version of the procedure described in the main text, show that the agreement 
with the replica theory improves quickly with increasing system size A'^ for this algorithm. The error 
bars estimate the standard deviation of the sample to sample fluctuations. 



point. Hence, the iteration converges to a vector J°° which is a local maximum or minimum 
of cd(J). In the latter case, wc just flip the sign of J°° to obtain a local maximum. 

Simulation results for the procedure, compared to the K,^ = \fN replica theory in Fig. 1, 
show that the performance of the algorithm is rather poor. This is in line with the theoretical 
findings, since these predict that r = is mctastablc, and the algorithm is only finding a local 
maximum. Figure 1 also shows result for an iterated version of the algorithm. There the 
algorithm is rerun with m = O.IN different random initial conditions, and the weight vector 
maximizing co{J) among the m outcomes is chosen. These result are in good agreement 
with the = \/N replica theory, indicating that beyond the phase transition the basin of 
attraction of the global maximum is quite large. 

Even if the simulations indicate that the replica approach is saved by in the end plugging 
in the correct scaling of the cutoff isTjv, the theoretical situation is highly unsatisfactory. I 
shall next show that a physically reasonable analysis can be provided by adapting the cavity 
method. This is much simplified if make some major changes to the notation. From now 
on the non- Gaussian source will be denoted by 7, whereas all of the N components of ^ are 
assumed independent standard Gaussian. Our primary goal is to calculate the typical value 
of Cr = max| Cr{J) with 



Cr{J) = p E airY + Vi^^J^n (4) 

,1=1 

where J is an N-dimensional vector. So Cr is the maximal value of the empirical contrast 
achievable on an r-shell. For generality, we shall now longer assume that g{x) must be cubic 

but consider any super-quadratic function which docs not diverge too quickly. In particular, for 
some fc > 0, \\m.x^rx> g{x) / x^^'' = ip should exist and be positive. Without loss of generality, 
we may then assume ip = 1. 



R. Urbanczik: Statistical physics of independent component analysis 



5 



We still have P = aL^N examples and consider the random variable Jd with the Gibbs 
density 



5(7^ [jfe) - gin' + vi^iJf^n • (5) 

Here [J] — J/\J\ and Zo{/3) is given by the normalization JdJpo{J) = 1. Note, that we 
are now using a factorizing Gaussian prior on J and, to compensate for this, the normalized 
vector [J] is used to calculate the field in Isj). 

A key task in the cavity approach is obtain the field distribution by calculating the thermal 
average ((/)(7'^, [Jd]'^ S,'^)) for any function 0. One finds 

where Jjj, is the random variable with the Gibbs density obtained when pattern fi is removed 
from the system, i.e. omitting the /i-th factor of the product in ^ and adjusting the partition 
function to Zd/^{I3). The variance of the cavity field [Jd//j]"^C is a self averaging quantity and 
it must then equal 1 — q for large TV, where q = \ ([Jd//j]) Normally, one would further 

argue that [Jn/^^\^^' becomes Gaussian in the thermodynamic limit. But if we assume this, 
the Jd/^ average in © diverges even when </) is a simple bounded function. This highlights 
the fact that the cavity field is not Gaussian in the large deviations regime because [Jn/f^^S,' 
cannot be larger than |^'^|. 

Hence, I rephrase the cavity argument as follows: For the purpose of calculating overlaps 
with a random vector such as the not normalized Jo/^ can for large N be treated as a 
Gaussian (with covariance matrix (1 — q)!). Then, the fluctuations of the cavity fleld obtained 
using the normalized [ /^] , 



can be explicitly calculated. This yields the important fact that there are just two relevant 
scales for the cavity fluctuations. For large Pfq^q{h) converges to e~^'^ /(i-9)/.^27r(l — q) 
if ft, <C a/ZV, but in the large deviations regime, for h = d^/N, 

lim Ar-ilnPjv(dViV) = -i^^ + iln(l-d2) (7) 
N^oo 2 1 — q 2 

if < < 1 . Now, in terms of the functional 



-^g(l,^y+h) 



-Vn 

the average in Eq. jnj can in the limit of large N be rewritten as ((/)(7'', [>/r>]"^^'')) 
Cy!^^^^L{(t))/ Cyf .^p.{l) with ~ q^^ {[Jo/^l\)^J^^ ■ So the quenched averages are 



(lnZD(/3)-lnZ„/^(/3))„ = (ln£«;^(l))_ (9) 
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where y is standard Gaussian. The last equation is obtained by setting = 1 in © . 

We can now consider whether the large deviations regime contributes to the averages in 
for a polynomially bounded 0. Using that for large arguments g{x) ~ x^^^ and referring 

to Eq. 0, we find that it will contribute if the maximum of 

is positive for large N. This won't happen if L„ ^ N^^ and Eq. then imphes that 

{ct>{"i^-, [JdVS,^)) , I ~ {4>{1tV)),, -V- The empirical mean equals the expectation value and 

SO the learning curve is trivial. Henceforth, we focus on the relevant scale, setting L„ — N'l^ . 

Our next task is to calculate the response when a new coupling Jq is added to the system 
and each pattern is augmented by a new component . We denote the augmented training 
set by D and use (|3J| to define the partition function Zf^{(3) of the TV + 1 dimensional sys- 

tem. Due to the A^-dependence of the Gibbs weight ' , it is simplest to assume 

a slightly different temperature Pn — /3L„+i/L„ in the augmented system. Then, when con- 
sidering the ratio Z]jj(/3jv)/ZD(/3), the two systems have the same Gibbs weight per pattern. 
Standard arguments jS] thus apply and yield that (lnZjj(/37v)/-^D(/3))]ij — Gs{QtO) for large 
A^. Here Gs{q,0) is the entropy term of the replica theory (Eq. O, but evaluated at r = 
because we are calculating the partition function for each r-shell individually. 

Having identified, via = N^'^^ the scale of the learning curve, (lnZD(/3))^ will 
converge to a finite quantity z{a, (3) in the thermodynamic limit. We then have 



k + 2 d (3k d 

(In Z^{I3n)/Zo{(3))^ = z{a,P) - a^——z{a,P) + — — z{a,(3). 

The derivative of z with respect to a is obtained from Eq. , and the thermal derivative is 
found from JHJ using (f> = g. 

Putting things together, we finally find for large N 

z{a,P) = L^^NhHnClf,{l)-^^0^\ -fG.(,,0), (11) 

where the value of q still has to be determined. 

For this, let us reconsider when the large deviations regime contributes to the value of 
£'^'^{1). Going back to Eq. H1U|I . with Ljv = N^''^ we see that as in the replica theory this 
is governed by a critical value qdP) of q. For q < qdP), Taayidu{d) is positive in the large 
N Hmit, so lfTT|l diverges. The possible range for q is thus qc{(3) < q < I- But, if we assume 
q > qdP), the large N limit yields the very simple result z{a,(3) = Gs{q) + c>:f3 {g{'-f,y))^ y. 
Now, on one hand, the empirical contrast is found by differentiating z{a, (3) w.r.t to (3. This 
yields (5(7, v)) ^^y + ^G's{l)§p<l- But computing the same quantity using © yields (3(7, y))^^y 
So q must stay constant when (3 varies, but this is impossible since gc(/3) 1 for /3 — > 00. 

Hence, the only possible value for q is qc{(3)- Evaluating Hll|) by taking the limit q qc{(3) 
from above, leads to the same result as in the K^^ = ^/N replica theory. But, of course, this 
has the same inconsistencies as found for the q > qc {(3) assumption. It also makes no physical 
sense to use Hll|l at the point of discontinuity since the cavity derivation neglects fluctuations 
of q. Even if these vanish with increasing iV, at the point of discontinuity, q = qdP), the true 
result will nevertheless depend on the unknown fluctuations. 
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But some conclusions can be drawn, knowing that q has the critical value. Let dfs be the 
unique positive value such that u{dfj) — for q — qdP). Then arguments analogous to the 
derivation of show that the probability of the posterior field [Jd]C exceeding d^/N is not 
exponentially small if d is lower than dp . More precisely, one finds for N oo and d < dp 

{N-^\nCfjQ{h-d^))/Cfjl)) = 0. 

Further, dp approaches 1 with increasing (3. But this is only possible if simply aligning the 
weight vector with the pattern maximizes the empirical contrast, at least upto sub-extensive 
corrections. So, in the notation of Eq. 01 we have C,- — Cr{[£,^]) for large iV, and thus finally 

a = (1 - r^ f-^/a + (gin +^/l-r^ y)) . (12) 

Maximizing this in r, the same learning curve is obtained for the cubic case, g{x) = a;^, as in 
the -fCv = replica theory (^). It is important to note that we have in essence just used the 
standard weak correlation assumptions of the cavity method in deriving (|12|l . In view of the 
good agreement with numerical simulations (Fig. 1), this strongly suggests that the cavity 
result is indeed exact in the thermodynamic limit. 

From an analytical point of view, it is intriguing that the present problem reveals a dif- 
ference in the scope of the replica and the cavity method. The latter can be transparently 
adapted to take into account that the cavity field is not Gaussian in the large deviations 
regime. But, commuting the thermal average with the disorder average, at the expense of 
considering moments, is part and parcel of using replicas. As a consequence, all the relevant 
fields become truly Gaussian. This points to implicit assumptions in the replica method, 
which need to be taken care of in any program to put the approach on a solid mathematical 
footing 0. 

* * * 
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(^)For g{x) = x*, the curve depends on whether (7^)7 > 3, since the fourth moment of a standard Gaussian 
is 3. If so, the value of r jumps from to 1 at Oc = l/((7*}7 — 3). The (7^)7 < 3 case, where one will 
use g{x) = — x*, shall be described elsewhere. It is much simpler since the large deviations regime does not 
contribute. 



